Skip to content

Throwaway: conc-64 gsm8k eval for DEP8+MTP3 dispatch token bug#1659

Open
Oseltamivir wants to merge 18 commits into
mainfrom
dsr1-dep8-mtp3-conc64-eval
Open

Throwaway: conc-64 gsm8k eval for DEP8+MTP3 dispatch token bug#1659
Oseltamivir wants to merge 18 commits into
mainfrom
dsr1-dep8-mtp3-conc64-eval

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Reproduce SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK < 256 silent corruption on DEP8+MTP3 disagg config
  • Narrows dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp search-space to a single conc-64 DEP8+MTP3 entry
  • With max(CONC_LIST)=64, server computes dispatch_tokens = 64/8*4 = 32 (below 256 threshold → broken All2All kernel)
  • Expected result: ~0% gsm8k (silent corruption), confirming the perf pareto's artificially inflated acceptance lengths at low concurrency

Background

The previous eval (#1644) passed with 0.9431 gsm8k because the conc-list was [64,128,256,512,640]max=640dispatch_tokens=320 (≥256, correct kernel). The actual perf benchmarks run each concurrency point solo, so at conc=64 the value drops to 32 → broken kernel → garbage tokens + inflated AL.

Not for merge — throwaway validation only.


Note

Medium Risk
Mutates installed sglang at container start and changes MoRI sizing for all disagg runs using server_sglang.sh; benchmark correctness improves but behavior diverges from unpatched upstream until merged.

Overview
Adds a MoRI dispatch-token fix for low-concurrency DEP8+MTP3 disagg runs where tiny SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK values (e.g. conc-64 → 32) silently corrupt MoE output (gsm8k ≈ 0).

Harness / server: server_sglang.sh now runs apply_moriep_dispatch_floor.py before launch_server, which in-place patches the image’s vendor moriep.py to floor num_max_dispatch_tokens_per_rank at 256 (avoids a full-file overlay that breaks MoriEPDispatcher). The DP+EP decode path sets MORI_MAX_DISPATCH_TOKENS_DECODE to BENCH_MAX_CONC_VALUE (per DP rank), not max_conc / dp_ranks, and drops the old harness env clamp in favor of the sglang-side floor. patches/README.md and perf-changelog.yaml document the behavior.

CI config (marked throwaway): dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp 8k1k MTP search-space is collapsed to a single conc-64, 1×DEP8 prefill/decode, MTP3 entry to reproduce dispatch=32 corruption on MI355X.

Reviewed by Cursor Bugbot for commit 1f386bd. Bugbot is set up for automated code reviews on this repo. Configure here.

@Oseltamivir Oseltamivir requested a review from a team June 3, 2026 18:43
@Oseltamivir Oseltamivir added the non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) label Jun 3, 2026
@Oseltamivir Oseltamivir added non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) and removed non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) labels Jun 3, 2026
@Oseltamivir

Copy link
Copy Markdown
Collaborator Author

/sweep

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

@Oseltamivir Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26905833527
Command: ``
Pinned ref: 93c050e
Approval: not required (trusted collaborator).

…en corruption

Narrow dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp search-space to a single
DEP8+MTP3 conc-64 entry. With max(CONC_LIST)=64, the server computes
SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32, which is below the 256
threshold that selects the correct All2All kernel. Expected: ~0% gsm8k
(silent corruption from the low-latency All2All variant).

Not for merge — throwaway validation of the dispatch token bug.
@Oseltamivir Oseltamivir force-pushed the dsr1-dep8-mtp3-conc64-eval branch from 93c050e to 45f69f5 Compare June 3, 2026 18:48
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Clamp MORI_MAX_DISPATCH_TOKENS_DECODE to minimum 256 when DP+EP are
both enabled, preventing SGLang's low-latency All2All kernel from being
selected. That kernel silently corrupts outputs at small buffer sizes.

Run A of A/B test: benchmark + eval WITH clamp on conc-64 DEP8+MTP3.
Comment thread benchmarks/multi_node/amd_utils/server_sglang.sh Outdated
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Run B of A/B test: benchmark + eval WITHOUT dispatch token clamp.
MORI_MAX_DISPATCH_TOKENS_DECODE will be 32 (<256 threshold).
Expected: corrupted output, inflated AL, ~0% gsm8k.
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

… benchmark+eval at conc-64

Validates Option A: instead of clamping the env var, patch the installed
SGLang moriep.py at runtime to enforce a minimum of 64 (AMD CDNA3/4
warpSize) on num_max_dispatch_tokens_per_rank before it reaches the MoRI
kernel config. If gsm8k recovers (like the 256 clamp did), this confirms
warpSize is the minimum viable buffer floor and scopes the upstream fix.
Comment thread benchmarks/multi_node/amd_utils/server_sglang.sh Outdated
…val at conc-64

Root cause: the MoRI All2All dispatch kernel (EpDispatchInterNodeV1Kernel /
IntraNode) writes dispatched tokens into warpSize-aligned receive slots
(destTokId = flagSlotId*warpSize + laneId, laneId 0..63), so each warp-chunk
spans 64 (CDNA3/4 wavefront) token slots. The per-rank receive region is sized
to maxNumInpTokenPerRank, which the harness derives as max(CONC_LIST)/TP*(MTP+1).
At low concurrency this collapses below 64 (conc-64/TP8/MTP3 -> 64/8*4 = 32), so
a single chunk overruns the 32-slot region -> silent out-of-bounds writes ->
semantically corrupt output (decodes fine, gsm8k=0).

Confirmed from the conc-64 Run B decode log: INTER_KERNEL_SWITCH=16 with
DISPATCH_TOKENS=32 selects the NON-LL InterNodeV1 kernel (32 > 16), yet output
was still corrupt -> the bug is the buffer-size floor, not the LL-vs-non-LL
kernel choice.

Fix: clamp MORI_MAX_DISPATCH_TOKENS_DECODE to >= 64 after the MTP multiply.
Only raises the value at low conc; adds a few MB of staging buffer but no
compute, so real throughput is unchanged (the ~3% edge of the corrupt run was
an artifact of dropping work). 64 is the principled minimum vs the proven-but-
larger 256.
@Oseltamivir Oseltamivir added non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) and removed non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) labels Jun 3, 2026
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

…ent)

The conc-64 run with the warpSize floor (64) still scored gsm8k=0.00
(run 26919517564), disproving the one-wavefront hypothesis. The per-rank
dispatch buffer must hold the routing fan-in (a receiving rank takes tokens
from all worldSize peers), not just one warp-chunk. Empirically on MI355X:
dispatch=32 -> 0.00, dispatch=64 -> 0.00, dispatch>=256 -> 0.94. Clamp to the
proven 256. Throughput is unchanged; the corrupt run's ~3% edge was dropped
work, not real speed.
@Oseltamivir Oseltamivir removed the non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) label Jun 3, 2026
The vendor image installs sglang as a namespace package where
__file__ is None.  os.path.dirname(None) throws TypeError, so the
patcher crashed and the floor was never applied — eval ran unpatched.

Fall through __file__ → __path__ → importlib.util.find_spec() to
locate the package directory robustly.
@Oseltamivir Oseltamivir added non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) and removed non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) labels Jun 4, 2026
spec = importlib.util.find_spec("sglang")
if spec and spec.submodule_search_locations:
pkg_dir = list(spec.submodule_search_locations)[0]
except Exception:
@Oseltamivir Oseltamivir removed the non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) label Jun 4, 2026
Comment thread benchmarks/multi_node/amd_utils/server_sglang.sh
@Oseltamivir Oseltamivir added the non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) label Jun 4, 2026
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

@Oseltamivir Oseltamivir force-pushed the dsr1-dep8-mtp3-conc64-eval branch from 66d701b to 9b50d69 Compare June 4, 2026 04:10
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot Jun 4, 2026
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot Jun 4, 2026
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot Jun 4, 2026
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot Jun 4, 2026
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot Jun 4, 2026
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

The vendor image installs sglang at /sgl-workspace/sglang/ but the
actual Python package is under python/sglang/ within that tree.
When __path__ returns the repo root, the patcher couldn't find
moriep.py. Add candidate paths and a bounded walk fallback.
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

f"{indent} self.num_max_dispatch_tokens_per_rank, {FLOOR}\n"
f"{indent})\n"
)
lines.insert(end + 1, floor_block)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbalanced parens corrupt moriep insert

Low Severity

If parenthesis balancing never reaches depth <= 0 before EOF, end stays at start and the floor block is inserted on the line after the opening get_int_env_var(, which can splice Python inside a multi-line call and break moriep.py without reporting failure.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 59dd6c3. Configure here.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

The formula divided BENCH_MAX_CONC_VALUE by decode_dp_ranks, assuming
--max-running-requests is a global limit split across DP ranks. It is
actually per-rank: each of the 8 DP schedulers independently allows up
to BENCH_MAX_CONC_VALUE requests. At conc-64/TP8/MTP3 the old formula
produced dispatch=32 (64/8*4), but each rank can hold 64*4=256 tokens,
causing 8x buffer overflow in MoRI's intra-node dispatch kernel (the
only guard is an assert compiled out under -DNDEBUG) and silent
corruption (gsm8k=0).

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1fdb89f. Configure here.

Comment thread benchmarks/multi_node/amd_utils/server_sglang.sh
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim)

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant